The dataset(Hu, Jia, and Wang 2023) I will be focusing on is on adult-onset Still’s disease (AOSD) which is a rare autoinflammatory disease. The most serious symptom resulting from AOSD is macrophage activation syndrome (MAS). To date, there is limited research on AOSD patients with MAS and biomarkers that could be used to detect it. Thus, the study(Jia et al. 2023) aims to contribute to this research area by comparing healthy controls to AOSD patients with MAS (AOSD-MAS). The dataset used is GSE247993 which was downloaded from GEO. In assignment 1, this dataset was first filtered to remove genes with low reads in which 74% of the genes survived (12404 were kept and 4280 were removed from the count data). The dataset was then normalized using TMM normalization using the edgeR package.
In the second assignment, we performed differential expression analysis on our cleaned and normalized count data using EdgeR. Then, we performed thresholded ORA through g:Profiler. Here, we will be taking the genes and ranking them based on amount of differential expression. Afterwards we will use GSEA to conduct non-thresholded enrichment analysis and visualize through Cytoscapee.
In assignment 2, we used quasi likelihood models to fit the data and QL F-test to test for differential expression
Here we will be conducting gene set enrichment analysis (GSEA) which is a non-thresholded analysis
if(!requireNamespace('BiocManager', quietly = TRUE)){
install.packages('BiocManager')}
if(!requireNamespace('htmltools', quietly = TRUE)){
install.packages('htmltools')}
if(!requireNamespace('dplyr', quietly = TRUE)){
tidyr::install("dplyr")}
if(!requireNamespace("kableExtra", quietly = TRUE)){
install.packages("kableExtra")}
if (!requireNamespace("fgsea", quietly = TRUE)){
install.packages("fgsea")}
tryCatch(expr = { library("RCurl")},
error = function(e) {
install.packages("RCurl")},
finally = library("RCurl"))
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("RCy3")
tryCatch(expr = { library("httr")},
error = function(e) { BiocManager::install("httr")},
finally = library("httr"))
library(htmltools)
library(dplyr)
library(kableExtra)
library(fgsea)
library(RCurl)
Package reference: (Cheng et al. 2023),(Wickham et al. 2023),(Xie 2023),(Zhu 2021),(Korotkevich, Sukhov, and Sergushichev 2019),(Temple Lang 2023)
load("qlf_output_hits_AOSDMAS.RData") #hits from qlf test
Below are the first 5 rows from the qlf test
qlf_output_hits_AOSDMAS$table[1:5,]
## logFC logCPM F PValue FDR
## DYSF -1.5144801 8.750271 72.40479 4.428107e-08 0.0003542958
## PIM1 -1.6794151 7.310160 66.67071 8.473289e-08 0.0003542958
## FAM20A -4.6289121 4.811680 66.11572 9.043350e-08 0.0003542958
## KREMEN1 -1.7076535 5.048758 64.02591 1.160104e-07 0.0003542958
## BACE1 0.8800601 4.714982 62.15092 1.458483e-07 0.0003563367
Sort QLF test output by rank
#Compute rank by p-value and sign
qlf_output_hits_AOSDMAS$table[,"rank"] <- -log(qlf_output_hits_AOSDMAS$table$PValue, base = 10) *sign(qlf_output_hits_AOSDMAS$table$logFC)
#Sort by ranking in descending order
ranked_hits <- as.data.frame(qlf_output_hits_AOSDMAS[order(-qlf_output_hits_AOSDMAS$table$rank),])
#See snippet of rank sorted table
ranked_hits[1:10,]
## logFC logCPM F PValue FDR rank
## BACE1 0.8800601 4.714982 62.15092 1.458483e-07 0.0003563367 6.836099
## TOP1MT 0.9430445 4.001077 57.90244 2.500136e-07 0.0005090277 6.602036
## ENG 0.9559476 6.305448 48.53399 9.200027e-07 0.0012487503 6.036211
## CDIP1 1.0724350 4.947028 42.71557 2.273028e-06 0.0016568907 5.643395
## NAT9 0.5916054 4.673682 39.97362 3.588454e-06 0.0017534621 5.445093
## PCDH12 1.1253732 2.275230 37.74715 5.283619e-06 0.0023139709 5.277068
## PHF20 0.5316917 5.904806 37.52804 5.493218e-06 0.0023139709 5.260173
## PRDM11 1.3095216 2.710128 33.59850 1.134203e-05 0.0040751236 4.945309
## CRYBG3 0.5481675 4.776981 32.61084 1.372774e-05 0.0041435433 4.862401
## PID1 1.6729404 5.325294 32.47423 1.409916e-05 0.0041435433 4.850807
#Save ranked_hits
AOSDMAS_ranks <- cbind(rownames(ranked_hits),ranked_hits$rank)
colnames(AOSDMAS_ranks) <- c('GeneNames', 'Rank')
download_dir = file.path(getwd())
if(!file.path(download_dir,"ranked_hits.rnk")==FALSE){
write.table(AOSDMAS_ranks, "ranked_hits.rnk",
quote=FALSE,
sep='\t',
row.names=FALSE)
}
In order to run GSEA we must load in the Bader gene set.
geneset_file = "Human_GOBP_AllPathways_withPFOCR_no_GO_iea_April_01_2024_symbol.gmt"
# Download gene set if it doesn't exist in the directory
if(!file.exists(geneset_file)){
# store URL to get most recent gene set
geneset_url = "http://download.baderlab.org/EM_Genesets/current_release/Human/symbol/"
# connect to UR: and get html version of page
filenames = getURL(geneset_url)
tc = textConnection(filenames)
contents = readLines(tc)
# Create regular expression to get gmt with all files and not including terms from electronic annotation
rx = gregexpr("(?<=<a href=\")(.*.GOBP_AllPathways_noPFOCR_no_GO_iea.*.)(.gmt)(?=\">)", contents,perl = TRUE)
# get files that satisfy the regular expression 'rx'
gmt_file = unlist(regmatches(contents, rx))
# File path to download file
geneset_filepath <- file.path(download_dir, gmt_file)
# Download file
download.file(paste(geneset_url, gmt_file, sep = ''), destfile = geneset_filepath)
}
Now, we can run GSEA. The following code is adapted from this tutorial
#run_gsea - true/false
# This parameter is for the compilation of the notebook.
run_gsea <- TRUE
if(run_gsea){
#Get the ranks of genes
ranks = ranked_hits$rank
names(ranks) = rownames(ranked_hits)
#Parse geneset file and ranks into fgsea
fgsea_output <- fgsea(pathways = fgsea::gmtPathways(geneset_filepath),
stats = ranks,
minSize = 15,
maxSize = 500)
}
# Check if the above worked
head(fgsea_output)
## pathway
## 1: 'DE NOVO' POST-TRANSLATIONAL PROTEIN FOLDING%GOBP%GO:0051084
## 2: 'DE NOVO' PROTEIN FOLDING%GOBP%GO:0006458
## 3: 10Q11 21Q11 23 COPY NUMBER VARIATION SYNDROME%WIKIPATHWAYS_20240310%WP5352%HOMO SAPIENS
## 4: 10Q22Q23 COPY NUMBER VARIATION%WIKIPATHWAYS_20240310%WP5402%HOMO SAPIENS
## 5: 11P11 2 COPY NUMBER VARIATION SYNDROME %WIKIPATHWAYS_20240310%WP5348%HOMO SAPIENS
## 6: 13Q12 12 COPY NUMBER VARIATION%WIKIPATHWAYS_20240310%WP5406%HOMO SAPIENS
## pval padj log2err ES NES size
## 1: 0.6127321 0.8363711 0.04929177 0.3141686 0.9141703 30
## 2: 0.6147860 0.8377383 0.04811840 0.3018334 0.9022431 34
## 3: 0.1768531 0.4194903 0.11524000 0.4117029 1.2364095 35
## 4: 0.1594595 0.3945190 0.12503337 0.4442195 1.2669248 27
## 5: 0.3304721 0.6027206 0.15851411 -0.3144822 -1.0839712 35
## 6: 0.6690442 0.8728559 0.04860342 0.3358254 0.8594298 17
## leadingEdge
## 1: ST13,HSPA1A,HSPA1B
## 2: ST13,HSPA1A,HSPA1B
## 3: SF3B1,MAPK8,ARHGAP22,HSF1,ERCC5,HSF4,...
## 4: GLUD1,GPR15,ZMYND11,PRXL2A,BMPR1A,SIRT4,...
## 5: CD82,CD63,LGALS3BP
## 6: MAPK8,SGCB,DAG1,SACS,TRAF5
Next, we will create a function to reformat the fGSEA results to GSEA report files
gsea_format <- function(fgsea_output, ranks){
# find max rank
max_rank <- apply(fgsea_output, MARGIN = 1, FUN = function(x){
max(which(names(fgsea_output) %in% unlist(x[8])))
})
# store information in the following table that follows the format of a GSEA report file
gsea_output = cbind(fgsea_output$pathway,
fgsea_output$pathway,
"Details ...",
fgsea_output$size,
fgsea_output$ES,
fgsea_output$NES,
fgsea_output$pval,
fgsea_output$padj,
0,
max_rank,
"Not available")
# Rename columns
colnames(gsea_output) <- c("name", "description", "GS DETAILS", "SIZE", "ES",
"NES", "pval", "padj", "FWER", "Rank at Max",
"leading edge genes")
# Return the gsea_output
return(gsea_output)
}
Use the above function to store positively and negatively ranked genes to use for cytoscape in the latter portion of this assignment
# Store the positively and negatively enriched genes and all enriched gene
pos_geneset = gsea_format(fgsea_output[fgsea_output$NES > 0],ranks)
neg_geneset = gsea_format(fgsea_output[fgsea_output$NES < 0], ranks)
enriched_geneset = gsea_format(fgsea_output, ranks)
# Save the positively and negatively enriched genes in tsv files for Cytoscape
write.table(pos_geneset,
file = "gsea_report_for_na_pos.tsv",
row.names = FALSE,
quote = FALSE)
write.table(neg_geneset,
file = "gsea_report_for_na_neg.tsv",
row.names = FALSE,
quote = FALSE)
fgsea_output[NES > 0][order(fgsea_output[NES > 0]$padj), ][1:20, c("pathway", "padj")] %>% kable(caption = 'Table 1. Top 20 enriched upregulated gene sets in AOSDMAS patients', digits = 30) %>% kable_styling(bootstrap_options = c("striped", "hover", "responsive"))
| pathway | padj |
|---|---|
| EUKARYOTIC TRANSLATION ELONGATION%REACTOME%R-HSA-156842.4 | 1.980135e-19 |
| RRNA PROCESSING IN THE NUCLEUS AND CYTOSOL%REACTOME DATABASE ID RELEASE 81%8868773 | 1.980135e-19 |
| RRNA PROCESSING%REACTOME DATABASE ID RELEASE 81%72312 | 1.980135e-19 |
| MAJOR PATHWAY OF RRNA PROCESSING IN THE NUCLEOLUS AND CYTOSOL%REACTOME%R-HSA-6791226.5 | 1.007309e-18 |
| PEPTIDE CHAIN ELONGATION%REACTOME%R-HSA-156902.4 | 1.007309e-18 |
| PROTEIN SYNTHESIS: ALANINE%PATHWHIZ%PW101384 | 1.007309e-18 |
| PROTEIN SYNTHESIS: ARGININE%SMPDB%SMP0111853 | 1.007309e-18 |
| PROTEIN SYNTHESIS: ASPARAGINE%SMPDB%SMP0111854 | 1.007309e-18 |
| PROTEIN SYNTHESIS: ASPARTIC ACID%SMPDB%SMP0111858 | 1.007309e-18 |
| PROTEIN SYNTHESIS: CYSTEINE%PATHWHIZ%PW112918 | 1.007309e-18 |
| PROTEIN SYNTHESIS: GLUTAMIC ACID%PATHWHIZ%PW112922 | 1.007309e-18 |
| PROTEIN SYNTHESIS: GLUTAMINE%SMPDB%SMP0111862 | 1.007309e-18 |
| PROTEIN SYNTHESIS: GLYCINE%PATHWHIZ%PW112928 | 1.007309e-18 |
| PROTEIN SYNTHESIS: HISTIDINE%PATHWHIZ%PW112929 | 1.007309e-18 |
| PROTEIN SYNTHESIS: ISOLEUCINE%SMPDB%SMP0111872 | 1.007309e-18 |
| PROTEIN SYNTHESIS: LEUCINE%SMPDB%SMP0111873 | 1.007309e-18 |
| PROTEIN SYNTHESIS: LYSINE%SMPDB%SMP0111874 | 1.007309e-18 |
| PROTEIN SYNTHESIS: METHIONINE%PATHWHIZ%PW112933 | 1.007309e-18 |
| PROTEIN SYNTHESIS: PHENYLALANINE%PATHWHIZ%PW112934 | 1.007309e-18 |
| PROTEIN SYNTHESIS: PROLINE%PATHWHIZ%PW113695 | 1.007309e-18 |
A snippet of Table 1. shows that mostly genes related to RNA processing and protein synthesis are enriched in AOSDMAS patients.
fgsea_output[NES < 0][order(fgsea_output[NES < 0]$padj), ][1:25, c("pathway", "padj")] %>% kable(caption = 'Table 2. Top 20 enriched downregulated gene sets in healthy patients', digits = 30) %>% kable_styling(bootstrap_options = c("striped", "hover", "responsive"))
| pathway | padj |
|---|---|
| HALLMARK_INTERFERON_GAMMA_RESPONSE%MSIGDBHALLMARK%HALLMARK_INTERFERON_GAMMA_RESPONSE | 0.000000e+00 |
| DEFENSE RESPONSE TO OTHER ORGANISM%GOBP%GO:0098542 | 0.000000e+00 |
| HALLMARK_INTERFERON_ALPHA_RESPONSE%MSIGDBHALLMARK%HALLMARK_INTERFERON_ALPHA_RESPONSE | 0.000000e+00 |
| DEFENSE RESPONSE TO SYMBIONT%GOBP%GO:0140546 | 1.330000e-27 |
| NEUTROPHIL DEGRANULATION%REACTOME DATABASE ID RELEASE 81%6798695 | 2.436700e-23 |
| INNATE IMMUNE RESPONSE%GOBP%GO:0045087 | 9.286229e-21 |
| RESPONSE TO BACTERIUM%GOBP%GO:0009617 | 5.994325e-20 |
| RESPONSE TO VIRUS%GOBP%GO:0009615 | 2.973521e-18 |
| HALLMARK_INFLAMMATORY_RESPONSE%MSIGDBHALLMARK%HALLMARK_INFLAMMATORY_RESPONSE | 3.930417e-18 |
| HALLMARK_TNFA_SIGNALING_VIA_NFKB%MSIGDBHALLMARK%HALLMARK_TNFA_SIGNALING_VIA_NFKB | 1.234689e-17 |
| RESPONSE TO CYTOKINE%GOBP%GO:0034097 | 5.329018e-17 |
| ANTIGEN PROCESSING-CROSS PRESENTATION%REACTOME DATABASE ID RELEASE 81%1236975 | 5.534476e-17 |
| REGULATION OF RESPONSE TO BIOTIC STIMULUS%GOBP%GO:0002831 | 5.534476e-17 |
| INTERFERON SIGNALING%REACTOME%R-HSA-913531.4 | 9.374199e-17 |
| REGULATION OF INNATE IMMUNE RESPONSE%GOBP%GO:0045088 | 1.584857e-16 |
| ER-PHAGOSOME PATHWAY%REACTOME DATABASE ID RELEASE 81%1236974 | 2.321175e-16 |
| POSITIVE REGULATION OF RESPONSE TO EXTERNAL STIMULUS%GOBP%GO:0032103 | 5.492647e-16 |
| HALLMARK_COMPLEMENT%MSIGDBHALLMARK%HALLMARK_COMPLEMENT | 9.391929e-16 |
| SIGNALING BY INTERLEUKINS%REACTOME%R-HSA-449147.13 | 1.684306e-15 |
| DEFENSE RESPONSE TO BACTERIUM%GOBP%GO:0042742 | 2.414913e-14 |
| INTERFERON ALPHA BETA SIGNALING%REACTOME DATABASE ID RELEASE 81%909733 | 3.157939e-14 |
| POSITIVE REGULATION OF DEFENSE RESPONSE%GOBP%GO:0031349 | 1.181926e-13 |
| DEFENSE RESPONSE TO VIRUS%GOBP%GO:0051607 | 1.669415e-13 |
| CELLULAR RESPONSE TO CYTOKINE STIMULUS%GOBP%GO:0071345 | 2.165669e-13 |
| INTERFERON GAMMA SIGNALING%REACTOME%R-HSA-877300.7 | 3.142044e-13 |
Table 2. shows that in healthy patients, most of the enriched genes are related to the immune system. ## Questions 1. What method did you use? What genesets did you use? Make sure to specify versions and cite your methods. I used the fgsea R package (10?) to conduct the non-thresholded analysis using the GSEA v4.3.3 (Subramanian2005-fy?).
Human_GOBP_AllPathways_noPFOCR_no_GO_iea_April_01_2024_symbol.gmt
gene set was used from the Bader lab that I downloaded through R.
AOSDMAS_gs <- nrow(fgsea_output)
AOSDMAS_enriched <- nrow(fgsea_output[NES > 0])
AOSDMAS_signif <- nrow(fgsea_output[NES > 0][padj < 0.05])
healthy_enriched <- nrow(fgsea_output[NES < 0])
healthy_signif <- nrow(fgsea_output[NES < 0][padj < 0.05])
group <- c('AOSDMAS', 'Healthy')
enrich <- c(AOSDMAS_enriched,healthy_enriched)
signif <- c(AOSDMAS_signif, healthy_signif)
tabl <- data.frame(cbind(group, enrich, signif))
colnames(tabl) <- c('Group', 'Number of enriched genesets', 'Number of significant genesets')
There were 5606 number of genesets in total from GSEA anlysis.
tabl %>% kable(caption = 'Table 4. Summary of enrichment data', digits = 30) %>% kable_styling(bootstrap_options = c("striped", "hover", "responsive"))
| Group | Number of enriched genesets | Number of significant genesets |
|---|---|---|
| AOSDMAS | 2624 | 145 |
| Healthy | 2982 | 804 |
In the ORA we saw that in AOSDMAS patients, there were many genes sets related to the immune system or reactions which agrees with our results in here using non-thresholded analysis. In both ORA and GSEA, we also see that pathways related to protein polymerization were also upregulated in AOSDMAS patients. However, we see more terms related to antigen presentation in ORA compared to GSEA. All together, these results also agree with the biological themes brought up in the original paper.
The comparison however, may not be so straightforward because they fundamentally use different pathways to analyze the genes. Also, in thresholded analysis, there is information that might have been lost because of the nature of providing a threholded for genesets.
1020 nodes 10952 edges Node cutoff = 0.1 Edge cutoff = 0.375 P-value = 1.0 FDR Q-value = 0.001
Figure 1. Enrichment map of genes
before manual layout
Figure 2. Annotated enrichment
map
I selected the ‘Layout network to prevent cluster overlap’ option. I kept all other default settings as shown below: - Cluster algorithm = MCL Cluster - Edge weight column - None
Figure 3. Settings used
to annotate enrichment map
Make a publication ready figure - include this figure with proper
legends in your notebook. Figure
4.Enrichment map of gene sets from AOSDMAS patients compared to healthy
patients analyzed by GSEA. Red nodes represents genesets that were
enriched in AOSDMAS patients and blue represents genesets enriched in
healthy patients.
Collapse your network to a theme network. What are the major themes present in this analysis? Do they fit with the model? Are there any novel pathways or themes?
Figure 5. Collapsed
enrichment map showing major biological themes.
From Figure 5. we again see many enriched genes in healthy patients that are related to inflammation responses, which are the blue nodes. We also see on the right of Figure 5, two major themes that we see in the GSEA analysis where pathways related to RNA processing and protein synthesis is enriched in AOSDMAS patients.
In the paper, they studied the most serior complication of AOSD which is macrophage activation syndrome that is initated by the activation of immune system and a cytokine storm. As mentioned previously, we have seen in both the ORA and GSEA that pathways related to immune system activation is enriched and upregulated in patients with AOSDMAS which supports the mechanism discussed in the paper. I also saw that in our enrichment analyses that surface markers of monocytes were enriched which is also found in the paper. These surface markers have costimulatory behaviour that enhances each other’s activities and are mostly found on intermediate monocytes.
In the paper’s GO enrichment analysis, they revealed genes that were enriched in the AOSDMAS group which are involved in defense response to virus and cytokine-mediated signalling pathways which agrees with the results in this analysis and analysis from assignment 2. One difference is in the paper, there were pathways related to nuclear division that were enriched in AOSDMAS patients whereas my results did not show that.
Unfortunately, the paper did not report genes that were downregulated so I am not able to compare this portion of the results.
Yes, there has been many sources of evidence showing the role of the innate system activation during the pathogenesis of AOSD. One publication has shown that macrophages and neutrophils initate and facilitate inflammation in patients with AOSD. (wang_2019_pathogenesis?) Another group has also shown that the activated monocytes migrate and accumulate in tissues and differentiate into pro-inflammatory macrophages. (billiau_2005_macrophage?) These publications further support the findings in this assignment that the there is an increased immune response associated with activated monocytes in AOSD patients with and without MAS.
I chose to use the REACTOME_THE_NLRP3_INFLAMMASOME gene
set which can be downloaded here
This analysis was chosen because there were in-vitro experiments in the paper that provided evidence that the DNA-sensing pathway was upregulated in monocytes derived from AOSD-MAS and furthermore it was previously shown that NETs-DNA from AOSD patients activated macrophages and it was hypothesized to be via activation of the NLRP3 inflammasome.(Hu et al. 2019) After post-analysis, we do in fact see relationships between NLR3P gene set and gene sets related to inflammation which probably do lead to macrophage activation.
Figure 6. Post-analysis of
enrichment map using NLRP3 inflammasome gene set.